Fix ECDH key exchange race condition in distributed deployments#111
Merged
Conversation
4f9aa32 to
65cf6a6
Compare
c70291a to
cea0f1d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix ECDH Key Exchange Race Condition in Distributed Cluster
Problem Summary
The MPC cluster fails to start properly when nodes are deployed in a distributed environment (production). Nodes get stuck in an infinite ECDH key exchange retry loop, preventing the cluster from becoming ready and accepting requests.
Observed Symptoms
Node0 (successful):
Node1 & Node2 (stuck):
Impact:
Root Cause Analysis
The Race Condition
There was a timing-dependent race condition in
pkg/mpc/registry.goat theregisterReadyPairsmethod (lines 129-134):The Bug: This code sets
r.ready = truewhen all peers are registered in Consul, but does not verify that ECDH key exchange is actually complete. The log message claims "including ECDH exchange completion" but the code never callsr.isECDHReady()to verify this.Why It Manifests in Production But Not Locally
Local Environment (Masked Bug)
When all nodes run on the same machine:
Local Timeline:
Production Environment (Exposed Bug)
When nodes are on different servers:
Production Timeline:
Why The Bug Causes an Infinite Loop
isEcdhReady=trueisEcdhReady=falseSolution
Changes Made
1. File:
pkg/mpc/registry.goChange 1.1: Refactored ready state check
Replaced inline ready check with dedicated method:
Change 1.2: Added
checkAndUpdateReadyState()methodNew method that properly validates BOTH conditions:
Key improvement: Now explicitly checks
r.isECDHReady()which verifies all symmetric keys are established.Change 1.3: Register ECDH completion callback
In
NewRegistry(), set up callback to re-check ready state when ECDH completes:Why this is critical: Handles the case where all peers are in Consul but ECDH completes later. When the last ECDH key is received, we immediately check if the cluster is now fully ready.
2. File:
pkg/mpc/key_exchange_session.goChange 2.1: Added callback mechanism to interface
Change 2.2: Added callback field to struct
Change 2.3: Implemented callback setter
Change 2.4: Trigger callback when ECDH completes
In the ECDH message handler, detect when all keys are received:
Key improvement: Immediately notify when ECDH is complete, allowing registry to update ready state.
How The Fix Works
Event-Driven Synchronization
The fix changes from timing assumptions to event-driven synchronization:
Before (timing-dependent):
After (event-driven):
Flow Comparison
Before (Broken):
After (Fixed):
Testing
Verification Steps
Build the fixed code:
Test distributed deployment:
Deploy to 3 separate servers and verify all nodes become ready:
# Check logs on each node tail -f /home/ubuntu/.pm2/logs/mpcium0-dev-error.log tail -f /home/void/.pm2/logs/mpcium1-dev-error.log tail -f /home/ubuntu/.pm2/logs/mpcium2-dev-error.logExpected logs on all nodes:
Test with artificial latency (optional):
Simulate production network conditions locally:
Test Scenarios Covered
Impact
Before
After
Additional Notes
Why This Bug Was Hard to Detect
Network Factors That Affect Production
Future Improvements
Consider adding:
Files Changed
pkg/mpc/registry.go- Fixed ready state logic and added callback handlingpkg/mpc/key_exchange_session.go- Added callback mechanism for ECDH completionBreaking Changes
None. This is a bug fix that maintains backward compatibility.
Migration Guide
No migration needed. Simply deploy the updated binaries.